Models for Distributed, Large Scale Data Cleaning
نویسندگان
چکیده
Poor data quality is a serious and costly problem affecting organizations across all industries. Real data is often dirty, containing missing, erroneous, incomplete, and duplicate values. Declarative data cleaning techniques have been proposed to resolve some of these underlying errors by identifying the inconsistencies and proposing updates to the data. However, much of this work has focused on cleaning data in static environments. Given the Big Data era, modern applications are operating in dynamic data environments where large scale data may be frequently changing. For example, consider data in sensor environments where there is a frequent stream of data arrivals, or financial data of stock prices and trading volumes. Data cleaning in such dynamic environments requires understanding the properties of the incoming data streams, and configuration of system parameters to maximize performance and improved data quality. In this paper, we present a set of queueing models, and analyze the impact of various system parameters on the output quality of a data cleaning system and its performance. We assume random routing in our models, and consider a variety of system configurations that reflect potential data cleaning scenarios. We present experimental results showing that our models are able to closely predict expected system behaviour.
منابع مشابه
A Practical Desalinization Model for Large Scale Application
Salinity of soil and water is the most important agricultural hazard in arid and semi-aridregions. In saline soils, yield production directly influences by soluble salts in the root zone aswell as by shallow water table depth. The first step for reclamation of such soils is reducingsalinity to optimum level by leaching. The objective of this study was to develop a practicalmodel to estimate wat...
متن کاملAn Efficient Data Replication Strategy in Large-Scale Data Grid Environments Based on Availability and Popularity
The data grid technology, which uses the scale of the Internet to solve storage limitation for the huge amount of data, has become one of the hot research topics. Recently, data replication strategies have been widely employed in distributed environment to copy frequently accessed data in suitable sites. The primary purposes are shortening distance of file transmission and achieving files from ...
متن کاملA TWO-STAGE DAMAGE DETECTION METHOD FOR LARGE-SCALE STRUCTURES BY KINETIC AND MODAL STRAIN ENERGIES USING HEURISTIC PARTICLE SWARM OPTIMIZATION
In this study, an approach for damage detection of large-scale structures is developed by employing kinetic and modal strain energies and also Heuristic Particle Swarm Optimization (HPSO) algorithm. Kinetic strain energy is employed to determine the location of structural damages. After determining the suspected damage locations, the severity of damages is obtained based on variations of modal ...
متن کاملLarge-scale Inversion of Magnetic Data Using Golub-Kahan Bidiagonalization with Truncated Generalized Cross Validation for Regularization Parameter Estimation
In this paper a fast method for large-scale sparse inversion of magnetic data is considered. The L1-norm stabilizer is used to generate models with sharp and distinct interfaces. To deal with the non-linearity introduced by the L1-norm, a model-space iteratively reweighted least squares algorithm is used. The original model matrix is factorized using the Golub-Kahan bidiagonalization that proje...
متن کاملAccess control in ultra-large-scale systems using a data-centric middleware
The primary characteristic of an Ultra-Large-Scale (ULS) system is ultra-large size on any related dimension. A ULS system is generally considered as a system-of-systems with heterogeneous nodes and autonomous domains. As the size of a system-of-systems grows, and interoperability demand between sub-systems is increased, achieving more scalable and dynamic access control system becomes an im...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014